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INTRODUCTION 


In  this  manuscript  soma  very  basic  ideas  of  important  con¬ 
sequence  are  discussed.  These  ideas  are  important  for  any  prac¬ 
ticing  engineer  in  pattern  recognition.  The  topics  include 
equivalent  classifier,  dimensionality  reduction,  fusion  of  clas¬ 
sifier,  time  varying  statistics  etc. 

Throughout,  this  presentation,  it  is  assumed  that  the 
reader  is  familiar  with  the  mechanics  of  constructing  discriminant, 
selecting  features  and  other  related  properties.  Therefore  no 
attempt  is  made  to  make  this  presentation  comprehensive .  Most 
of  the  subjects,  discussed  in  this  presentation  are  considered 
'obvious'  in  standard  books'  written  on  the  subject  of  pattern 
recognition.  It  is  our  belief  that  the  readers  of  this  manuscript 
will  benefit  considerably  by  giving  some  time  to  these  "obvious" 
results;  mainly  because  the  obvious  results  are  sometimes  very 
confusing  results. 


1 


1. .  OPTIMAL  AND  EQUIVALENT  DISCRIMINANTS 

Many  important  applications  of  pattern  recognition  can  be 
characterized  as  either  waveform  classification  or  geometric 
figures  classification.  In  order  to  perform  this  type  of  classi¬ 
fication/  typically  one  measures  some  observable  characteristic  of 
the  object.  This  collection  of  measurements  is  called  the  features 
and  the  process  of  deriving  the  features  is  called  feature  extrac¬ 
tion.  Typically  a  classifier  is  developed  using  these  features. 

lit  any  classification  problem/  one  of  the  basic  assumptions 
is  that  there  exists  soeie  difference  between  the  populations  from 
which  the  objects  are  sampled.  Thus,  there  is  always  a  classi¬ 
fier  which  can  be  used  to  differentiate  between  the  populations. 

We  will  call  it  "the  natural  classifier"  and  denote  it  by  C. 
Existence  of  such  a  natural  classifier  is  of  fundamental  impor¬ 
tance  in  pattern  recognition.  This  will  also  be  useful  in  the 
following  discussion. 

To  fix  the  ideas,  we  consider  the  example  of  character  recog¬ 
nition  between  letters  A  and  B.  Note  that  there  exists  a  natural 
classifier  [which  human  mind  employs]  to  distinguish  between  A 
and  B.  For  mechanical  or  computerized  discrimination  one  would 
select  features  to  construct  a  classifier.  This  feature  selec¬ 
tion  can  be  done  in  many  ways  and  success  of  the  corresponding 
classifier  depends  very  heavily  on  these  features.  For  the  hand 
written  characters  (A  and  B)  two  possible  feature  extraction  pro- 
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cedures  are: 

(a)  put  a  standard  grid  on  each  letter  and  measure  the 
shaded  area  in  each  cell,  [see  attached  figure]. 

(b)  Record  the  presence  and  absence  of  a  portion  of  the 
letter  in  each  cell  by  1  and  0  respectively  and  obtain  the  feature 
vector  consisting  of  0's  and  l's.  The  collection  of  these  features 
can  be  employed  to  construct  two  respective  classifiers) . 


Mathematically,  the  feature  extraction  is  equivalent  to  trans¬ 
formation  from  the  natural  space,  S,  to  euclidean  space  Rn,  ie 

CP3  -  TCS] 

where  F  denotes  the  new  features  and  T  is  the  transformation. 
Typically  the  transformation  T  is  nonlinear  and  sane  time  one  may 
not  be  able  to  express  it  in  terms  of  mathematical  equations. 

Let  C  and  C'  denote  the  classifiers  in  the  natural  space  and 
the  feature  space  respectively  and  t"1  the  inverse  transformation  of 
T.  Then  C  and  C*  will  be  equivalent. 

This  equivalence  is  obvious  because  the  existence  of  T_1  implies 
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that  there  is  a  one  to  one  transformation  from  S  to  Space  of 
features  and  conversely. 

Extending  this  idea,  suppose  that  and  T2  are  two  transfor¬ 
mations,  {F^ }  and  {F2}  the  corresponding  feature  then  the  classi¬ 
fiers  C1  and  C2  would  be  equivalent  to  each  other  and  to  C  if 
T^1  and  T2*  exist.  Obviously  and  C2  could  be  equivalent  to 
each  other,  if  there  exists  a  one  to  one  transformation  from 
{F^}  to  {F2 } ,  without  being  equivalent  to  C. 

Thus,  the  optimal  classifier  is  the  'natural'  classifier  and 
generally,  it  is  not  possible  to  explain  how  it  works.  On  the 
other  hand  to  obtain  a  classification  procedure  a  set  of  features 
is  obtained.  Given  these  features  one  can  attempt  to  obtain  op¬ 
timal  classification  procedures.  But  it  must  be  remembered  that 
this  optimality  is  conditional  upon  the  given  feature  set.  In 
other  words,  if  a  new  feature  set  is  given  than  another  'optimal' 
classifier  will  be  obtained.  The  two  'optimal'  classifiers  will 
be  equivalent  if  and  only  if  it  is  possible  to  obtain  a  one  to  one 
transformation  from  one  feature  set  to  the  other. 

To  summarize,  a  classifier  is  optimal  only  after  a  set  of 

features  have  been  selected.  This  optimality  should  not  be  confused 

% 

with  'global'  optimality. 


2.  OM  MEMBER  OF  FEATURES 


Zn  a  typical  pattern  recognition  problem  there  are  two  stages: 
(a)  the  feature  selection  stage  (b)  design  of  a  classifier  based 
on  the  selected  features.  Classifier  design  is  relatively  easier 
in  the  sense  that  if  the  features  and  their  class  dependent 
joint  distributions  are  available  then  one  can  apply  Bayes  proce¬ 
dure  to  obtain  optimal  classifier.  Zn  case  the  class  dependent 
distributions  are  partially  known# or  even  if  they  are  completely 
unknown#  modifications  of  the  optimal  classifier  or  nonparametric 
classifiers  are  applicable.  On  the  otherhand  the  problem  of  fea¬ 
ture  selection  is  quite  difficult  because  no  standard  procedures 
can  be  applied  and  moreover  the  features  are  specific  to  the  pro¬ 
blem  under  consideration. 

The  problem  of  feature  selection  arises  generally  because  the 
data  collected  in  the  natural  space  is  not  suitable  for  mathemati¬ 
cal  manipulation.  For  example,  consider  computerized  classifica¬ 
tion  of  ECG  curves  to  one  of  the.  several  disease  classes.  In 
this  case  mathematical  manipulations  with  these  random  ECG  curves  are 
almost  impossible,  therefore  the  need  for  alternative  ways  of 
storing  the  information  in  a  curve.  For  this  particular  problem, 
one  possible  procedure  is  to  apply  Karhunen-Loeve  expansion. 

Feature  selection  also  plays  an  important  role  as  a  method  of 
data  reduction.  For  example,  although  the  data  may  be  available 
in  a  vector  form,  suitable  for  mathematical  manipulations,  yet 
its  dimensionality  may  be  very  large.  In  such  situation  it  is 
desired  to  compress  the  dimensionality  without  sacrificing  in  the 
performs  ce. 
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Let  Cn  be  a  classifier  based  on  n  features.  Let  CB  be  a 
classifier  based  on  a  subset  of  the  original  n  features.  Then  it 
appears  to  be  a  well  known  property  that  the  performance  of  CB 
cannot  be  superior  than  Cn*  A  proof  of  this  property  is  easily 
obtained  in  the  case  of  two  class  classification  problem  with 
under- lying  normal  distribution  with  common  covariance  matrix. 

In  this  case  the  performance  of  the  optimal  classifier  is  measur¬ 
ed  in  terms  of  Mahalonobis  distance  Z  1  £  where  £  •  and  E 

is  the  qommon  covariance  matrix*  is  the  mean  vector  i“l,2. 

The  error  probability  decreases  as  the  Mahalonobis  distance  ^  { 

increases  because  the  error  probability  is  given  by  *[-%(£'£ 

Since 

<1  *  4U  <2.1 

where 


$2.1  *  $2"E21  E11  $1  '  E22.1  “  E22*E21  E11  E12 

and 

$2.1  E22.1  $2.1  2  °* 

it  follows  immediately  that  S"1  s  *  Ht*$1E11  $1^  ^2* 

Thus,  a  subset  selection  may  not  lead  to  a  better  classifier. 
But  this  does  not  imply  that  if  n  >  m  and  the  first  classifier  is 
based  on  n  features  and  the  second  classifier  is  based  on  m  features 
then  the  first  classifier  is  necessarily  better  than  the  second. 

In  fact,  in  some  cases  a  classifier  based  on  n  features  may  do 

worse  than  another  classifier  based  on  m  features  (m  <  n) . 
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To  demonstrate  this  property,  consider  the  following  trivial 
example.  Consider  two  populations  in  which  the  underlying  ran¬ 
dom  vector  is  (k+1)  dimensional,  k  >  1.  Suppose  the  marginal  dis¬ 
tributions  of  the  first  k  components  are  identical  in  the  two 
populations,  thus  the  first  k  components  have  no  discriminatory 
capability.  On  the  otherhand  the  (k+1) th  component  has  different 
distributions  in  the  two  population.  Two  researchers,  who  are 

unaware  of  this  property,  select  feature  sets  consisting  of  the 

* 

first  k  components  and  the  last  (k+l)th  component  only  respective¬ 
ly.  It  is  obvious  that  the  first  researcher  will  obtain  poorer 
discriminant  although  his  feature  set  contains  a  larger  number  of 
components  than  the  second  researcher. 

In  general,  for  every  classifier  based  on  m  features,  one  can  produce 
an  equivalent  classifier  with  n  features  where  n  z  m  because  all  we  need 
to  do  is  to  add  n-m  nan-informative  independent  features  to  the  set  of  n 
features.  Ch  the  other  hand  given  a  classifier  based  on  m  features  one  can 
produce  an  equivalent  classifier  based  on  one  feature  alone,  as  seen  below. 

The  existance  of  a  1  dimensional  equivalent  criterion  is 
easily  seen  in  the  case  of  two  class  problem  when  the  underlying 
distributions  are  normal  with  common  covariance  E.  In  this  case, 

the  classification  rule,  based  on  n  features  x  is  given  by: 

Vi+u,'  -l 

classify  %  to  class  1  iff  ($-  — 5— =■)  I  >  0 


where  standard  notations  are  employed, 
feature  Y,  where  „  ... 

t 

*  -  1 


Choosing  the  one  dimensional 
(yl”w2*  ' 


we  obtain  an  equivalent  classifier.  The  result  is  now  obvious  for 


7 


m  s  2.  in  general  for  any  arbitrary  distributions,  let 

Y  - 

where  f ^(jj)  is  the  probability  distribution  function  correspond* 
ing  to  the  ith  class. 

In  summary,  it  cannot  be  said  that  a  discriminant  based  on 
larger  number  of  features  is  necessarily  better  than  another  dis¬ 
criminant  which  uses  a  smaller  number  of  features,  unless  the  second 
set  of  features  is  a  subset  of  the  first  set.  Additional  features 
will  improve  the  performance  of  a  discriminant  only  if  they  are 
informative.  Finally,  the  performance  of  a  discriminant  depends 
not  on  the  number  of  features  but  on  the  choice  of  features  them¬ 


selves 


3.  DISCRIMINATION  VERSUS  CLASSIFICATION 


There  are  two  main  goals  in  pattern  recognition.  The  first 
goal  could  be  called  "Separating  distinct  sets  of  objects"  and 
the  second  goal  is  to  "allocate  new  items  to  previously  defined 
groups”.  Fisher  (1938)  used  the  term  "discrimination"  to  refer 
to  the  first  goal.  A  more  descriptive  term  is  "separation”. 

The  second  goal  is  referred  to  as  "classification"  which  is  also 
called  "allocation”  see  Johnson  and  Wichran  (1980)  and  "identifi¬ 
cation"  see  Rao  (1974).  These  concepts  are  further  explained 
below. 

By  discrimination  or  separation  we  understand  how  to  describe 
either  graphically  or  algebraically ,  the  differential  features 
of  objects  (observations)  from  several  known  collections  (popu- 
1  ations.) .  We  try  to  find  "discriminants”  whose  values  are  such 
that  the  collections  are  separated  as  much  as  possible. 

By  classification  or  allocation  we  understand  how  to  sort 
objects  (observations)  into  2  or  more  leveled  classes.  The  em¬ 
phasis  is  on  deriving  a  rule  which  can  be  used  to  "optimally” 
assign  a  new  object  to  the  leveled  classes. 

The  difference ,  just  pointed  out,  between  discrimination 
and  classification  is  generally  not  explained  in  standard  texts 
on  pattern  recognition  .  Inconsistent  use  of  the  terminology  by 
statisticians  and  pattern  recognitioners  has  also  caused  confusion. 
Moreover,  a  function  which  separates  may  also  be  used  for  alloca¬ 
tion  and  conversely,  an  allocatory  rule  may  suggest  a  discrimin¬ 
atory  procedure.  Thus  in  practice  the  two  goals  may  overlap  and 
the  distinction  between  separation  and  allocation  becomes  blurred. 

Allocation  or  classification  rules  are  usually  developed  from 
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"learning”  samples.  Observations  are  randomly  selected  and  are 
known  to  come  from  specified  populations!  These  samples,  also 
known  as  training  set,  are  then  examined  for  differences  and 
based  on  the  results  of  this  examination,  the  entire  sample  space 
is  partitioned  in  as  many  regions  as  the  number  of  populations. 

If  we  denote  these  disjoint  and  exaustive  regions  by  R^,R2,...,Rp 
where  p  *  number  of  populations  and  if  new  observation  falls  in 
the  region  R^,  it  is  allocated  to  the  ith  population. 

Fisher's  idea,  in  discriminating  between  two  populations 
and  *2  on  the  basis  of  observed  values  of  presumably  relevant 
variables  ^  was  to  transform  the  multivariate  observations  to 
univariate  observations  y  such  that  the  y's  derived  from  popula¬ 
tions  tt  ,  and  ir2  were  separated  as  much  as  possible.  For  simpli¬ 
city,  Fisher  suggested  the  use  of  linear  combinations  of  £  to 
create  the  y's.  This  idea  can  be  extended  to  several  classes  and 
also  to  several  discriminants  y1#y2,  ••  wbere  y^  provides  the  best 
separation,  y2  the  next  best  separation  and  so  on.  It  is  well 
known  that  these  discriminants  have  also  been  used  for  classifica¬ 
tion  and  have  "optimum”  properties  for  the  normal  distributions. 
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One  of  the  most  important  issues,  after  a  classifier  has 
been  designed,  based  on  a  training  set,  is  to  find  hoe  well  it 
performs.  Considerable  attention  has  been  paid  to  this  problem. 
To  study  this  problem,  most  attention  has  been  given  to  the  two 
class  problem  assuming  the  underlying  distributions  are  normal. 

Denoting  the  probability  of  error  of  misclassification  by 
p  there  are  several  types  of  error  probabilities  which  should  be 
distinguished . 

p  :  When  the  discriminant  uses  the  known  population 
parameters  and  this  discriminant  is  applied  to  in¬ 
dependent  observations  from  the  population, 

* 

p  :  when  the  discriminant  is  based  on  a  training  set  and 
its  performance  is  measured  using  the  given  training 
set. 


p* :  when  the  discriminant  is  based  on  the  training  set 
and  its  performance  is  measured  on  another  indep¬ 
endent  set  called  the  test  set 
p  t  when  the  discriminant  is  based  on  a  training  set 
and  its  performance  is  measured  on  the  independent 
samples  of  the  population^ 

A  general  result 

p  <  p  <  p 

*  'v 

was  established  by  Mills  in  1965.  The  dependence  of  p  and  p 
on  the  ratio  nA»  where  n  is  the  size  of  the  training  set  and  k 
is  a  dimension  of  the  underlying  normal  random  variable,  was 
studied  by  Foley  <1972),  when  E,  the  common  covariance  matrix  is 

n 


assumed  known.  Foley  observed  that  the  difference  between  E(p) 

A 

and  p  is  very  large  if  n/k  <<  3.  Only  if  n/k  >  3,  E(p)  -  p  is 

A 

small  and  therefore  p  can  be  considered  a  reasonable  estimator 
of  p.  Mehrotra  (1973)  observed  that  if  £  is  also  estimated,  and 

A 

if  n/k>5  then  only  p  can  be  considered  as  a  good  estimator. 

However,  obtaining  the  estimate  p  is  the  most  important  pro¬ 
blem,  but  its  distribution  is  very  complex.  Asymptotic  results 
have  been  obtained  by  several  investigators.  Lachenbruch  and 
Mickey  (1965)  studied  several  possible  estimators  of  p  for  the 
normal  distribution  and  concluded  that  the  leave-one-out  method, 

A 

which  is  equivalent  to  jackknifing  the  estimator  p,  provides  a 
good  estimator  of  p.  This  work  was  further  studied  by  Cochran 
(1968).  Due  to  space  considerations,  it  is  prohibitive  to  go 
into  details  of  work  in  this  area.  Toussaints  (1974)  biblio¬ 
graphy  provides  useful  references  related  to  this  problem. 

Several  studies  have  also  been  performed  to  study  the  per¬ 
formance  of  the  Fisher's  linear  discriminant.  These  include  the 
study  of  its  performance  when  their  common  covariance  assumption 
is  not  applicable,  when  the  underlying  distributions  are  not 
normal.  Most  of  these  studies  are  empirical.  Overall  perfor¬ 
mance  of  the  Fisher's  linear  is  found  to  be  satisfactory. 

In  the  study  of  the  Fisher  Linear  discriminant,  other  pro¬ 
blems  of  interest  are:  (i)  study  of  the  coefficients  of  the 
Fisher  linear  discriminant  and  (ii)  the  problem  of  testing  the 
significance  of  the  obtained  discriminant  function.  Sitgreaves 
(1961)  observed  that  the  estimates  of  the  coefficients  in  the 
linear  discrimiant  are  biased  and  obtained  the  bias.  Nanda  (1949) 
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has  shown  that  as  the  sample  size  increases  the  standard  errors 
of  the  estimates  of  the  coefficients  decrease  but  do  not  converge 
to  zero.  Using  these  and  other  similar  results  one  can  construct 
confidence  intervals  and  test  the  hypotheses  regarding  these  es¬ 
timates.  Of  particular  interest  is  the  hypotheses  whether  or  not 
a  certain  coefficient  is  zero. 

The  second  problem#  namely  the  testing  of  the  significance 

of  the  obtained  discriminant  function,  was  considered  by  Fisher 

2 

by  means  of  developing  a  test  for  0  ,  the  Mahalonobis  distance. 
Rao  (1946,  1948)  obtained  a  test  based  on  the  distributional 


property  of 

(n^  +  n2  -  k  -  1)  n,j  n2 
(n^  +  n2)  (n1  +  n2  -  2)  k 


(Xj  -  x2)  'S'1  (Xj  -  x2) 


which  is  distributed  as  F  (k,  n2  +  n2  -  k-1) .  in  the  above  ex¬ 
pression  n^,n2  are  sample  sizes,  x^,  x2  are  sample  means,  s”1  is 
the  common  covariance  matrix  and  k  is  the  dimensionality  of  the 
underlying  random  variable. 


5.  SEQUENTIAL  VS.  NONSEQUENTIAL  CLASSIFICATION  PROCEDURES  WITH 

SEVERAL  POPULATIONS 


For  simplicity  of  presentation,  we  consider  the  case  of  3 
populations  denoted  by  ir^,  ir2  and  ir3.  Given  an  observation  we 
wish  to  classify  it  to  one  of  these  three  populations. 

Let  f ^ (g)  be  the  density  associated  with  population 
ir,  i-1,2,3,  pj«  the  prior  probability  of  population  ir^,  and 
C(k|i)  *  the  cost  of  allocating  an  item  to  ir^  when  it  belongs 
to  iri#  for  i,k*l,2,3.  The  Bayes  classification  rule,  which  mini¬ 
mizes  the  expected  cost  of  misclassification  is  given  as  follows. 
The  observation  ^  is  classified  to  population  k*l,2,3  for 

which 


3 

l  p±  C(k|i) 

ijk 


(5.1) 


is  smallest. 

If  all  the  misclassification  costs  are  equal,  then  the  term 
in  (5.1)  will  be  smallest  when  the  omitted  term  is  largest.  Thus, 
for  equal  cost  of  misclassification,  the  observation  £  is  classi- 


to 


to 


and  to 


*1  ^  ^1  *i^ 


*2 

P3 


f2  <*> 
f3<*> 


ff2  P2 

*  if  P3 


f2<$> 


[Pi  fl<*> 
IP3  f3<*> 

[P1  fl(*> 

jp2  *2ty 


(5.2) 
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A  second  classification  rule  is  obtained  by  coopering  two 
populations  at  a  tine.  Zf  the  costs  of  misclasaification  are 
equal  then  this  rule  ist  classify  ^  to  t ^  if  p^  f^Cjg)  >  p2  f2(j|) 
•»d  P1  £ l  >  P3  *3  ®r  equivalently  if  p3  f  (x)  -  p2  f 2  (x)  >  0 

and  p^  f^tjj)-  p3  f3(jj)  >0.  The  other  two  cases  can  be  described 
in  a  similar  manner.  This  procedure,  which  is  alternatively 
written  as  (5.3) below  is  equivalent  to (5.2) described  earlier. 

Allocate  %  to  if 

fk<*>  „  £i  for  all  i*l,2,3.  (5.3) 

Note  that  in  (5.3)  one  obtains  the  same  inequality  which  is  ob¬ 
tained  in  the  case  of  two  population  classification,  with  the 
major  difference  that  the  desired  inequality  should  be  satisfied 
for  all  three  possible  values  of  i. 

One  may  alternatively  decide  to  follow  a  third  procedure  des¬ 
cribed  below,  which  is  sequential  in  nature.  First  allocate 
$  to  ir^  or  (*2  or  ffj)  by  Bayes  rule.  If  the  decision  is  to  allocate 
£  to  ir2  or  *3,  then  in  this  second  stage  allocate  it  to  one  of 
the  two  populations  by  again  using  the  Bayes,  rule.  Note  that  this 
third  procedure  la  not  equivalent  to  the  Bayes  rule  described  above. 
It  can  be  easily  seen  by  means  of  an  example.  Consider  the  case 
of  three  univariate  normal  populations  with  common  variance  1  and 
respective  means  3,  5  and  6.  Let  the  apriri  probabilities  be  all 
equal  to  1/3  and  the  costs  of  misclasaification  be  also  all  equal. 

In  this  case,  using  procedure  (5.2)  the  boundaries  are  obtained  at 
4  and  5.5.  That  is,  if  x  <  classify  it  to  population  (with 
mean  3),  if  4  s  x  <  5.5  classify  it  to  population  *2  (with  mean  5) 


On  the  otherh&nd , 


and  if  x  a  5.5  claaaify  it  to  population  *3. 
the  third  procedure  described  above ,  will  allocate  x  to  -  if 


and  to  (ir2  *3)  otherwise.  In  this  case  the  boundary  is  given 
by  3.9075.  In  otherwords,  if  x  <  3.9075  then  it  is  classified  to 
otherwise  to  *2  or  ^3*  before  the  boundary  between  ir2  and  ir 
is  5.5. 

In  summary,  the  two  alternative  sequential  procedures  are 
different  and  clearly  the  first  procedure  of  comparing  two  at  a 
time  is  optimum. 


6.  ON  THE  POSSIBILITY  OF  AN  UNKNOWN  GROUP 


Typically,  in  a  classification  problem  it  is  assumed  that 
there  are  a  specified  number  k  of  classes  and  the  objective 
is  to  classify  a  new  observation  into  one  of  these  k  classes. 

In  a  more  realistic  situation  it  is  possible  that  a  given  obser¬ 
vation  may  not  belong  to  any  one  of  the  given  k  classes.  In 
other  words,  there  exists  the  possibility  of  the  new  object  be¬ 
longing  to  an  unknown  class,  a  class  not  previously  specified. 
Thus,  in  such  situation  we  encounter  two  problems  (a)  classify 
the  given  object  in  one  of  the  k  given  classes  (b)  show  that 
there  exists  another  class  to  which  the  new  object  belongs. 

This  problem  has  not  been  considered  in  great  detail  in 
the  literature.  This  is  because  the  class  conditional  density, 
the  prior  probability  etc.  are  all  unknown  for  this  unknown 
class.  However,  in  one  particular  situation  the  problem  can  be 
considered  as  seen  below,  (Rao,  1974) . 

Let  N(u,£)  denote  the  normal  density  of  a  p-dimensional 
random  vector  with  mean  vector  u  and  covariance  matrix  £.  Let 
and  N(jj2,£)  be  two  class  conditional  densities  and  let 
u2  and  2  be  known  (or  estimated  from  very  large  training 
sets) .  Consider  the  well  known  Fisher  linear  classifier.  For 
classifying  a  new  observation  x,  it  is  given  by 

x’  I"1  (u1-u2)  -  jUij+Uj)  £_1  <*VU2)  2  0 
or  equivalently  by 
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Let  us  now  consider  normal  densitites  with  covariance  matrix  £ 
and  mean  vector  given  by  Xpj  +  d-X)  u2*  This  density  is  given  by 

— — p73 - 1  exp-ytx-AUj-a-DUj)  ,£“1(x-xii1-(l-x)w2) 

<2,,)  |i|7 

.  — — p7j - T  •*p-j(x-ii1+<l-»)(ul-u2))’l“l<*-i>1*(l-lMl>1-»2)> 

<J”  |I|T 


- T  ®XP-7<(X-W1)  ,E“1(x-U1)+2(l-X)  (W1-U2)E“1(X-W1) 

<2”’  i*it 

+  (i-x)2(u1-u2) ,£"1(u1-w2)>  . 


In  the  above  representation,  using  the  Neyman-F.isher  factoriza¬ 
tion  theorem,  it  is  clear  that  (x-Pj) '  £-1(p1-p2)  is  a  sufficient 
statistic  for  the  unknown  parameter  X.  As  a  consequence  of  this 
sufficient  property,  it  suffices  to  know  Y«(x-p^) '  E”*(p^-p2)  to 
draw  statistical  inference  regarding  all  normal  populations  with 
means  lying  on  the  straight  line  joining  p^  and  p2. 

Now  consider  the  following  problem.  Given  a  new  obser¬ 
vation  x  and  two  classes  with  densities  N(plfE)  and  N(p2,E),  the 
problem  is  to  classify  x  into  one  of  these  two  classes.  But 
our  above  result  implies  that  Y  is  sufficient  for  X  and  there¬ 
fore,  it  should  be  possible  to  test  whether  x  belongs  to  a  class 
ifhicirhas  the  normal  distribution  with  mean  Xpj  +  (1-X)p2  and  co- 
variance  £.  The  idea  is  that  if  x  does  not  belong  to  this  class 


hypothesis 


versus 


Hq:  mean  *  Xu^+(1-X) y2,  X  unknown 
Hls  H0  not  true* 


A  test  for  this  hypothesis  is  given  by  T>C  where 


.  _i  Ux-y.)'  £_1  (y.-vi_)]2 

(x-y. )  Z  A  (x-y. ) - - t — ^ - - — - 


(y2"Wx)  ^  <w2-wi* 


and  T  is  distributed  as  a  chisquare  random  variable  with  (p-1) 
degrees  of  freedom.  Thus,  we  have  the  following  result. 


Result:  Let  N(y^,£)  and  N(y2,£)  be  two  normal  densities.  Given 
x,  classify  it  to  one  of  the  two  populations.  "However ,  it  may  be 
possible  that  x  belongs  to  another  class  which  is  neither  of  the 
above  two  classes  or  any  other  normal  population  with  mean  lying 
on  a  straight  line  joining  these  means.  Then  one  can  follow 
these  steps 

Step  1:  First  test  the  hypothesis  HQ  vs.  using  T.  If  T  >  C, 
where  C  is  obtained  by  using  the  chisquare  property  of  T,  then 
we  conclude  that  x  does  not  belong  to  any  one  of  the  two  specified 

classes. 

Step  2:  If  T  <  C,  then  use  the  usual  Fisher  linear  classi¬ 
fier  to  classify  x  to  one  of  the  two  specified  class. 

i  *  * 

Example:  Suppose  y^  ■  (2,6),  y2  ■  (4,9)  with  common  covariance 
matrix  t  2J.  Then,  a  given  observation  $  ■  (13,18)  should  be 

classified  to  one  of  these  two  populations  only  if  there  is  evid- 
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ence  that  it  does  not  belong  to  any  other  class.  For  the  above 
problem 

With  1  d.f.  the  acceptable  value  of  a  chisquare  random  variable  is 

3.841  at  level  of  significance  0.05  and  7.879  at  level  of  signi¬ 
ficance  0.005.  Hence,  one  would  conclude  that  this  observation 

does  not  belong  to  any  one  of  the  two  specified  classes. 

It  is  worthwhile  to  note  that  the  above  procedure  is  dev¬ 
eloped  for  the  case  when  the  parameters  u's  and  Z  are  all  known. 

If  they  are  unknown  and  the  training  set  is  large,  then  the  above 
procedure  applies  in  the  asymptotic  sense.  For  small  training 
set  a  satisfactory  procedure"  has  not  been  developed. 

In  a  recent  publication  Lin  (1978)  suggests  a  variation  of 
the  idea  used  in  Neyman -Pearson  theory  of  hypothesis  testing. 

To  understand  his  approach,  consider  the  problem  of  testing  a 
simple  hypothesis  H0  versus  a  simple  hypothesis  Hx  where 

H-:  f(x)  -  f  (x) 

and 

H^s  f(x)  -  fj^x) 

and  f(x)  denotes  the  probability  density  function  of  the  random 
variable  x.  According  to  the  Neyman-Pearson  theory  a  test  for 
the  above  problem  is  obtained  by  minimizing  the  probability 
PCx  is  classified  as  having  the  pdf  f^(x)|true  pdf  is  fj^x)] 
keeping  the  probability,  PCx  is  classified  as  having  the  pdf 
f x (x) | the  true  pdf  if  fQ(x)]  fixed.  Equivalently,  if  the  entire 
sample  space  S  is  partitioned  in  two  sets  R  and  Rc  •  S-R,  and  x 
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is  classified  as  having  the  pdf  f Q(x)  if  x  e  R,  then  accord¬ 
ing  to  the  Neyman  Pearson  theory  R  is  chosen  such  that: 


a  »  j"  fQ(x)dx 
RC 


is  fixed,  and 

6  *  /  f^xjdx 
■R 

is  minimized. 

In  the  absence  of  the  knowledge  of  f . (x) ,  Lin  suggests  the 
following  test:  Choose  R  such  that 

°  *  /c  fo(x)dx 

RC  U 

is  fixed  ana  / 

V(R)  *  ‘R  dx 


is  minimized. 

An  adaptation  of  this  concept  to  the  classification  problem  is 
suggested  as  follows.  Suppose  the  problem  is  to  classify  x  to  one 
of  the  two  classes  with  respective  pdf's  hj (x)  and  hj (x)  and  the 


prior  probabilities  p^  and  p2  respectively.  However,  let  there 
exists  a  possibility  that  the  object  may  not  belong  to  any  one 
of  these  two  classes.  In  this  situation  Lin  suggests  that  one 
can  use  a  two  step  procedure 


Step  (a)  Test  for  the  hypothesis 

Hq :  f (x)  ■  fg (x)  5  px  h^(x)  +  p2  h2(x) 

vs 

H^:  density  is  unknown 

Step  (b) .  If  in  Step  (a)  the  null  hypothesis  is  accepted, 
then  apply  the  conventional  classification  rule. 
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Remark ;  It  can  be  easily  seen  that  Lin's  proposal,  in  the  frame- 
work  of  testing,  is  to  test  the  null  hypothesis 


vs 


Thus, 


H0s  f(x)  -  f(x) 


:  f(x)  *  uniform  over  a  certain  unknown  interval. 

Lin  is  replacing  the  unknown  density  by  the  uniform  density. 
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IN  GAUSSIAN  DISTRIBUTIONS 


7.0  ABSTRACT: 

The  purpose  of  this  note  is  to  show  that  it  is  not  possible 
to  obtain  a  subset  of  k  measurements  out  of  a  set  of  n  measure¬ 
ments,  which  provides  the  best  discrimination  between  two  popula¬ 
tions  by  extendin?  the  set  of  (k-1)  best  measurements.  This 
result  is  demonstrated  for  the  Gaussian  distribution. 

7.1  INTRODUCTION : 

Cover  and  Campenhoult  (  )  considered  the  problem  of  select¬ 
ing  the  best  k  out  of  n  measurements,  for  the  purpose  of  dis¬ 
criminating  between  two  populations.  They  showed  that  it 
is  not  possible  to  extend  the  set  of  best  (k-1)  measurements 
to  obtain  the  set  of  best  k  measurements.  In  fact,  they  proved 
that  there  does  not  exist  any  systematic  method  of  obtaining  a 
subset  of  k  measurements  for  the  purpose  of  discriminating  . 

Only  the  exastive  search  provides  the  desired  answer. 

In  order  to  prove  the  above  mentioned  property  Cover  and 
Campenhoult  first  related  n  measurements  of  the  distribution 
under  consideration  to  n2n-1  Gaussian  random  variables,  then  es¬ 
tablished  the  result  for  these  new  Gaussian  variables.  This 
author  has  not  been  able  to  follow  their  method  of  relating  n 
variabler  of  some  distribution  to  n2n~^  Gaussian  random  variables. 
Moreover,  it  is  not  clear  how  can  one  obtain  any  desired  order¬ 
ing  in  the  subsets  of  n  measurement,  in  terms  of  probability  of 
error  of  misclassification  and  also  obtain  the  specified  magni¬ 
tude  of  these  error  probabilities. 


The  purpose  of  this  note  is  to  show  that  any  extension  of 


the  best  (k-1)  measurements  to  a  set  of  k  measurements  (k<n) 
need  not  give  us  the  set  of  best  k  measurements .  This  is  demon - 
strated  in  the  case  of  Gaussian  random  variables  and  by  means  of 
two  simple  examples.  It  is  also  demonstrated,  by  means  of  an  ex¬ 
ample  ,‘  that  although  it  may  be  possible  to  obtain  a  desired  or¬ 
dering  of  subsets  of  n  measurement  (subject  to  a  natural  con¬ 
straint  given  below  and  also  in  Cover  and  Campenholt) ,  it  may  not 
be  possible  to  obtain  the  desired  magnitudes  of  the  error  pro¬ 
babilities.  This  later  property  is  also  obtained  in  the  case  of 
Gaussian  random  variables. 

Before  preceding  further  we  wish  to  recall  that  this  phen¬ 
omenon,  of  not  being  able  to  select  best  subset  of  k  out  of  n, 
also  occurs  in  the  context  of  regression  analysis  [see  Draper  and 
Smith  (1966)]. 


< 

i 
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7.2  TWO  EXAMPLES  TO  SHOW  THAT  ANY  EXTENSION  OF  "BEST"  k  TO  *fc+l" 

MEASUREMENTS  MAY  NOT  BE  THE  "  BEST  "  SET  OF  (k+1)  MEASUREMENTS": 

Before  presenting  the  examples  we  wish  to  recall  two  basic 
results. 

Result  (A)  :  Let  jj(  be  a  n-dimensional  normal  random  vector  with 
mean  and  covariance  matrix  &(o^)  where  i«l  or  2  depending  upon 
whether  £  was  drawn  from  population  ir^  or  ir2.  Let  the  prior  pro¬ 
babilities  of  ir^  or  ir2  ^  equal  and  the  costs  of  misclassification 
be  equal.  Then,  the  minimum  probability  of  misclassification,  is 
given  by  where  ♦  denotes  the  distribution  function  of  the 

standard  normal  random  variable  and 

A2  ■  (Rj.  -  u2) '  E-1  (ux  -  u2) 

Result  (B)  :  Let  A  be  a  (p+q)  *  (p+q)  postive  definite'  symnetric:  matrix  and 
be  a  (p+q)  dimensional  column  vector.  Let  A  and  £  be  partitioned  as  follows. 


'll 


*21 


_  12 
*22 


and  $  = 


*1 

$2 


In  terms  of  these  partitioned  matrices  we  have  the  following 
well  known  equality. 

$’  A  1  $  "  $1  *11  $1  +  $2.1  *22.1  $2,1  ' 
where  and  $2  are  p  and  q  dimensional  vectors,  A^,  A22  and 
and  A12  *  A21  are  P  *  P»  q  *  q  and  p  *  q  dimensional  matrices  and 

*22.1  "  *22  "  *21  *11  *12 
$2.1  "  $2  “  *21  *11  $1  * 

In  particular  if  A^2  i»  a  matrix  of  all  zero  elements,  then, 
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from  result  (B)  the  following  equality  is  obtained. 

A  1  *  $1  hi  $1  +  %2  A22  $2  * 

Let  £  a  •••»  $n)  '  •  From  result  (A)  it  follows 

that  if  one  wishes  to  choose  only  one  component  out  of  n ,  then 
the  optimum  choice  is  to  select  the  ith  component  such  that 


j 

— ■  max 


Example  1:  In  this  example  we  consider  the  special  case  n*3.  It 
is  observed  that  the  best  singleton  set  is  {X^}  but  neither  {X^,X2) 
nor  (X^/Xj)  is  the  best  set  of  two  components. 

Let  £  be  a  3-dimensional  normal  random  variable  such  that 

* '  ^ . ‘r!'  “  “VT “ 


P  -  (Ptj) 


.96  1 


1  .96  ,0  »  /| 


(7.1) 


Clearly,  the  best  singleton  set  of  measurements  is  given  by  {X^}, 

2  2  2  2 
because  (6^/ap  is  largest  among  all  (6j/o£)  for  i-1,2,3. 

Next,  we  calculate 


]~" 


(ijfe  5)"(iJ-  W  4]- 

and  finally 

( ‘A'f'n  ’  /*»  \  f'-ojsT'  ~2d23  ‘2  <3  .  ,  ■ 

[h/ifu'n)  (s;  l  2?  [if  23—}  ;?_)**• 


a22  ff23 
°23  °33 


It  is  obvious  that  the  set  {X2,X2)  is  the  best  set  of  two 
components,  which  it  is  not  an  extension  of  {X^>. 

Example  2:  The  above  example  is  extended  to  arbitrary  n.  In 
otherwords  we  show  that  there  exists  a  normal  distribution  such 
that  the  best  set  of  k  components,  when  extended  to  (k+1)  com¬ 
ponents  does  not  provide  the  best  set  of  (k+1)  components.  With¬ 
out  loss  of  generality  let  {xir  . . .  ,XJt_1)  be  the  best  set  of  (k-1) 
components  and  let  n-k+2.  Let  %  ■  (X1#  ...,  Xk+2) *  be  formally 
distributed  with  mean  vectors  and  j£2  and  common  correlation 
matrix  Z,  given  that  it  belongs  to  population  it ^  and  ir2  respective¬ 
ly  .  Let  £  *  1A1  -  M2  “d  Z  be  such  that 


A  *k+2>  ,  A  6k-l 

'o7  '  *  *  *'#  o7  ’  • 


2,  1.5,  1) 


'1  k+2  “1  “k-1 

and  the  correlation  matrix,  R,  corresponding  to  Z  satisfies 


R 


\\l[  °“ 

L°  !  p. 


urtiere  P  is  given  by  (7.1)  and  R^^  is  a  (k-1)  x  (k-1)  dimensional 
matrix.  From  result  (B)  ,  our  assumption  that  the  best  set  of  (k-1) 
components  is  (X^ ,  ...,  XJc_^},  and  Example  1  it  is  obvious  that 
the  best  set  of  k  components  is  given  by  {xlf  ...,  X^} .  But, 
tfhen  we  search  for  the  best  set  of  (k+1)  components,  it  turns 
cut  to  be  {X^,  ...,  XJc_1,  X^+1,  XJc+2)  which  is  not  an  extension 
cf  iXv  ...,  Xk>. 

Remark:  The  above  result  is  of  a  negative  nature.  A  more  useful 
result  would  be  to  discover  conditions  such  that  it  would  be  pos- 
lible  to  obtain  the  best  set  of  k  components  by  extending  the  best 
let  of  k  components  by  extending  the  best  set  of  (k-1)  components. 
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This  work  is  under  investigation. 

In  the  remaining  part  of  this  note  we  consider  one  other  as¬ 
pect  of  Cover  and  Campenhout's  result.  This  example  appears  to 
contradict  their  basic  theorem. 

Suppose  M^,  are  n-measurements  (Scalar)  and 

denotes  an  arbitrary,  non-empty,  subset  of  these  measurements  and 
P.(Si)  denotes  the  minimal  probability  of  error  when  the  elements 
of  are  used  for  the  purpose  of  classification  (between  two 
classes).  It  is  well  known  that  if  S^  e  then  P^S^  %  Pg (S^ ) . 
Probability  of  error  P#(S^)  can  be  used  to  establish  an  ordering 
among  all  (2n-l)  possible  selections  of  measurements.  Cover  and 
Campenhout  state  the  following  theorem. 

Theorem;  Given  an  arbitrary  ordering  on  the  subsets  of  measure¬ 
ments  M^,  m2,  ...,  Mr,  subject  to  the  monotonicity  contraints, 
there  exists  a  jointly  normally  distributed  random  vector  £  of  n 
dimension  which  has  exactly  the  same  ordering  and  the  same  proba¬ 
bility  of  error. 

Example  3:  The  following  example  shows  that  the  above  result  is 
not  correct.  Suppose  ,  M3  are  three  measurements.  The  7 

possible  non  empty  measurement  selections  are  ordered  below 
(subject  to  the  monotonicity  criterion  mentioned  above) .  The 
corresponding  error  probabilities  are  also  specified.  Let  the  order¬ 
ing  be  (Mj)  a  {M3>  i  {Mj^  z  z  z  {Mj,M3}  i 

and  the  corresponding  error  probabilities  be  .4,  .38,  .35,  .3,  .29, 
.28  and  .22  respectively.  At  this  stage  we  are  interested  in  the 
following  question. 

Is  it  possible  to  find  »  (X1,X2,X3)  which  is  normally  dis- 
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Although,  it  may  appear  that  are  8  possible  correlation  matrices 
such  that  the  first  six  equalities  in  (6)  are  satisfied,  however 
only  one  of  these  P  matrix  is  positive  definite.  This  matrix  is 
given  below.  The  remaining  seven  matrices  are  all  negative  de¬ 
finite. 


1 

-.23966 

-.2135 


-.23966 

1 

-.5392 


-.2135 

-.5392 

1 


Given  the  above  values  of  £|g  and  P,  the  remaining  probabilities 
of  error  are  fixed.  Consequently,  the  last  equality  in  (7.2) 
will  not  be  satisfied.  In  this  particular  example 

*{-|(6'i:“16)1/2}  -  *{-yC(|)’p"1(|)]1/2>  -  .1714 


which  is  different  from  the  desired  value  of  .22. 

In  general,  suppose  M^,  M2 ,  . ..,  Mn,  n  z  3  are  n  measurements. 
A  certain  order  among  subsets  of  (M^,  M2»  ...,  Mn>  and  the  asso¬ 
ciated  error  probabilities  are  given.  The  order  satisfies  the 
natural  constraints.  Using  the  error  magni tildes  corresponding  to 
the  singletons  {M^}  we  obtain  (6^/c^) ' s.  Using  the  error  magni¬ 
tudes  corresponding  to  {M^,M^}'s  we  get  P^'s.  At  this  stage  all 
of  the  free  parameters  are  fixed  and  consequently  all  of  the  other 
error  probabilities  are  also  fixed.  It  is  unlikely  that  a  speci¬ 
fied  order  among  the  components  will  be  satisfied. 

Remark ;  Cover  and  Campenhout  have  generated  n2n-1  Gaussian  random 
variables  for  the  n  original  measurements.  This  gives  them  enough 
freedom  of  selection  to  match  the  error  probabilities.  But,  their 
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process  of  arriving  at  n  normal 
variables  is  not  at  all  clear, 
agreement  between  their  theorem 


random  variables  from  these  n2n”^ 
This  is  the  major  source  of  the  dis¬ 
and  our  counter  example  . 


i 

i 

i 


I 
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ON  BOIES  TO  COMBINE  RESULTS  OF  SEVERAL  DISCRIMINANTS 


8.1.  INTRODUCTION 

In  discriminant  analysis,  we  try  to  allocate  objects  into  one  of  the 
given  classes  based  on  some  measurements  bf  the  objects.  There  exists 
instances  where  several  independent  attempts  have  been  mads  to  classify  a 
population  of  objects,  each  time  a  different  set  of  measurements  are  chosen 
and  a  new  decision  rule  is  constructed.  Suppose  we  are  presented  with  the 
results  of  these  decision  rules,  and  we  are  asked  to  utilize  these  results 
to  classify  the  population  of  objects  with  a  performance  better  than  each 
of  the  decision  rules,  whenever  possible.  In  this  paper  we  investigate  such 
methods  for  two  class  and  three  class  problems.  Ne  confine  to  the  case  when 
only  three  independent  sets  of  measurements  are  taken.  The  results  can  be 
generalized  in  a  similar  manner  for  other  cases. 

Let  X  -  lxi,x2,x3)t  be  a  vector  constructed  by  the  juxtaposition  of  the 
3  sets  of  measurements  X^  of  an  object.  The  dimension  for  each  measurement 
vector  is  p^.  Thus  X  is  of  dimension  p,  where  p  -  Pj+P2+P3«  Let 
X'  -  (D^(x^),  (D2(X2),  D3 (X3) ) ,  where  D^'s  are  the  three  discriminants  given, 
and  Di (X^)  is  the  decision  of  the  discriminant  0^  based  on  the  set  of  measurements 
X^  of  the  object. 

The  purpose  of  this  paper  is  to  develop  some  schemes  such  that  an  object 
represented  by  the  vector  X  can  be  classified  in  one  of  the  two  classes  w^  and  «2. 
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In  mm  cases*  the  set  of  measurements  on  an  object  are  naturally  grouped 
in  subsets  such  as  ths  X' s.  For  instance  if  the  classification  was  performed 
on  three  different  aspects  of  the  same  individual.  Since  the  complexity  in 
constructing  a  decision  rule  for  an  object  vector  X  increases  with  the 
dimensionality  p  of  X,  it  My  be  beneficial  to  develop  a  decision  rule  for  each 
of  the  subsets  of  Masuraments  X^,  and  combine  the  results  of  the  different 
decision  rules  in  a  sensible  fashion  to  obtain  a  decision  rule  for  X.  this 
avoids  the  complications  caused  by  high  dimensionality*  yet  all  the  features 
of  the  object  are  considered  in  the  classification.  In  other  cases  it  My  be 
difficult  to  get  the  complete  data  at  one  place. 

la  section  8.2.  ve  consider  the  Intuitively  appealing  Mjority  logic  and 
in  Section  8.3  the  optimum  method  of  combining  the  results  of  three  discrimin¬ 
ants  in  che  two  class  problem.  Section  8.4  contains  soee  general  reMrks  on 
the  two  class  problem.  In  Section  8.3,  it  is  demonstrated  that  there  would 
always  be  some  loss  in  tens  of  probability  of  correct  classification  when  the 
three  discriminants  are  co^incd  as  opposed  to  using  the  best  possible  dis¬ 
criminant  for  che  entire  X.  In  Section  8.8  the  case  of  multivariate  nonal 
is  further  investigated.  He  finish  this  paper  with  a  quick  review  of  these 
ideas  as  they  apply  to  the  three  class  problem. 
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2. 


THE  MAJORITY  RULE 


without  loss  of  generality  the  function  (X^)  can  be  defined  as 


W 


1  it  is  classified  into  by  0^ 

0  Xi  "  «2  ^  Di  • 


Clearly,  the  D^'  s  are  independent  from  one  another  provided  the  X^'s  are 
mutually  independent.  In  the  following  derivations,  we  assume  that  the  X^'a, 
for  i  “  1,2,3  are  mutually  independent. 

A  random  vector  X  in  the  object  space  is  thus  mapped  by  the  function  D^'  s 
into  the  3 -dimensional  binomial  space  S',  where  S'  *»  { (000) fc, (OOl)6, (010) (Oil) t, 
(100) *,  (101) *,  (110) \  (111)*). 


In  this  section  we  consider  the  majority  rule  of  combining  the  results 
from  D. 's  which  belong  to  the  sample  space  S'. 

The  majority  rule  is  an  intuitive  approach  to  classify  the  samples  in  S'. 
The  rule  is:  whenever  more  than  one  X^'s  are  classified  by  the  D^'s  to 


then  X  should  be  classified  to  u^. 

let  D  denote  the  majority  decision  function  i.e.  let  D  (X) 

HI  ■ 

that  X  is  classified  to  Thus  0^(1)  -  1  iff  £  D  (X  )  2  2, 

i-1  1  1 


and  X  is  classified  to  w. 


-  1  represent 
else  Db(X)  -  0 


let  0  -  Pr(D  (X)  -  l|w,), 

HI  1 

from  oi^.  let  ■  Pr(D^(X)  ■ 
such  that  S  £  a^. 


the  probability  of  correctly  recognising  an  object 
l|ux).  We  assume  that  2  J  for  all  i-1,  2  and  3, 


i 

f 
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Theorem  1  i)  Q  2  <*2  2  a^.  ii)  g  2  is  not  necessarily  satisfied. 


Proof 


I 


i)  It  is  sufficient  to  prove  Q  2  a^. 

He  observe  that 

Q  *  Q(«1*«2*a3)  *  ala2a3  +  #l<1"a2)a3  +  (1_ai,a2°3* 

It  is  easily  seen  that  Q  is  aonoton ically  increasing  in  each 
of  the  three  variables  0j/02'03* 

Since  a3  2  a2 

2  Q(*1,a2,a2) 

2  2 
■  *^«2  +®^®2^1”02^+al02^1~°2^+02 

*  ai<j0rj0i,W  • 

S~<*2  »  a2(2e1-2a1o2+«2-l) 

-  «2a-a2)  (20^-1) . 

It  is  obvious  that  the  above  expression  is  always  positive. 

.*.  Q  i  «2. 

ii)  g-a3  »  o^ajUj+o^d-CjJ+aj^UjU-ajJ+ejOjd-Oj^J-aj 

-  a1a2+a2°3+<,3al"2“l02a3"a3 
"  ai®2^'r<*3^“^“ai>^  (l-a2)a3 
therefore  Q  2  o3  iff  °^°2  (1~<*3)“(1“«1)  d-a2)<»3  2  0,  or 
equivalently,  a^d-fl^)  *  (1— <*^3  (l-a2)o3  . 


Q  2  a3  iff, 


(1-a^)  (l-<*2^ 


The  above  theorem  tells  us  that  tha  intuitively  appealing  majority  rule  is 

sometimes  inferior  in  performance  to  the  best  of  the  D^' s.  it  may  be  possible 

to  improve  the  performance  by,  instead  of  using  the  majority  rule,  using  a  linear 

3 

combination  of  tha  D. (X.)'s,  £  C.D. (X. ),  where  the  weight  C.  is  chosen  to  reflect 

*  1  i«i  *  1  *  1 


the  Magnitude  of  the  a^'s.  However,  we  will  not  disgress  into  this.  Instead, 
we  will  derive  a  nonlinear .procedure  which  provides  the  best  possible  discriminant 


based  on  the  D^'s. 


* 


. . . 


8.3.  THB  LIKELIHOOD  RATIO  DECISION  ROLE 

Let  «  PrtD^tX^)  »  J.1^),  8^  -  Pr(Di(Xi)  -  o|«2),  i  -  1*2,3. 

She  a^'s  and  8^'a  as*  tba  conditional  probabilities  of  correct  recognition  for 

observations  fro*  and  w^*  respectively* l-o^  and  1-8^  vill  be  probabilities 

of  misclassification.  Let 

3  3 

f  i  flj^)  -  ^Pr(D1(Xi)|«1),  g  s  PrOyx^  |«2> . 

me  likelihood  ratio  decision  rule  is  such  that  if  f/g  >  decide  Xew^, 
if  f/g  <  1  decide  Xcu^and  if  f/g  -  1  then  decide  Xet^  with  probability  .5. 

Theorem  2  Given  the  decisions  of  the  D^'i,  the  decision  rule  is  the 
optimal  decision  rule. 

Proof  Since  D^X^'s  are  mutually  independent,  the  following  equality 

holds 

3 

B  PrtD^jlwj)  -  Pr(D1,D2,D3|«i). 

Therefore  the  decision  rule  is  the  same  as  the  most  powerful  test  for  a 
simple  hypothesis  Xca^  versus  a  simple  alternative  Xet^,  given  by  the  Neyaan- 
Pearson  leans.  Consequently,  with  the  obvious  choices  of  the  constants  of  the 
Neynan -Pearson  leaaoa  the  discriminant  is  optimum. 

Zt  remains  to  find  the  performance  of  the  above,  optimal  decision  rule  0^. 
The  proposition  3  given  below  not  only  answers  this  question  but  also  relates 
the  two  decision  rules  and  discussed  in  Theorem  1.  In  the  proposition  3 
we  also  assume  o^  »  8^  i  ■  1,2,3. 

Proposition  3  Let  ■  8^,  i  *  1,2,3,  and  o^  s  «2  S  «3»  The  probability 

of  the  correct  deciaion  by  is  Q*  where 
C*  ■  a1o2o3+a1a2(l-a3)+o1(l-a2)a34ma*((l-a1)o2o3,o1(l-«2) (l-«3)) 

»  max (Probability  of  correct  decision  by  D>#  probability  of 

correct  decision  by  D^) . 
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Proof  By  Dt  any  observation  (D^X^)  )  will  bo  claaoifiod  aa 

belonging  to  when  it  actually  belongs  to  provided 


3  D.(X.)  1-D. (X.)  3 

no.11  (l-o.)  >  H  (l-o.) 

i-1  (-)  i-1 


Di(Xi)  (l-Dl(Xl) 


or  equivalently j 


2Di(Xi) 


i-1  i 


n  (- 


(-)  i-i  1_°i 


(with  probability  .5).  Let  (D^ (x^) ,D2 (Xj) ,D3 (X^) )  takes  values  in 
((1,1,1),  (0,1,1),  (1,0,1)}.  Clearly  the  decision  rule  will 
classify  Xcu^  because  (l-o^)”1  S  o^d-Oj)"^  i  a^d-o^)-1.  Similarly 
if  (D1(X1),D2(X2),D3(X3))  takes  values  in  {(0,0,0),  (1,0,0),  (0,1,0)} 
then  the  decision  rule  will  classify  them  as  casing  from  The 
only  other  cases  are  when  (D^X^  ,D2(X2)  ,D3(X3))  -  (1,1,0)  or  (0,0,1). 


Obviously,  we  could  classify  (1,1,0)  aa  coming  f: 

>  (<)  1 


«1(»a)  if 


Va(1-q3) 


(l-o1) (l-o2)o3 


and  we  could  classify  (0,0,1)  as  coming  from  provided 


(l-a1) (l-a2)e3 


»(«)!. 


By  sysnetry  of  the  above  two  cases  one  and  only  one  of  these  two  points  will, 
result  in  the  acceptance  of  u^.  Since  the  above  two  conditions  are  the  seam  as 
in  the  second  part  of  Theorem  1,  the  Q*  is  given  by  the  first  equation.  To  show 
that  the  second  equality  holds,  we  observe  that  if 

Q*  -  a1o2«3+o1a2(l-o3)+o1(l-o2)o3+o1(l-a2) (l-«3) 
then  the  right  hand  side  simplifies  to  Oj. 
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»a.4  SOME  REMARKS 


Sine*  both  of  th*  above  suggested  decision  rules  are  based  on  the  saaplea 
in  S'  only,  (S'  -  { (000)*,. . . (Ill)*}) ,  once  the  segments  X^  of  an  observations  X,  are 
classified  by  th*  D^'s,  w*  can  us*  template  natching  (checking  the  content  of 
th*  3-D  vector  X*  against  all  th*  possible  elements  in  S')  to  decide  the  clasv. loca¬ 
tion  of  X.  thus,  classification  using  either  the  aajority  or  optimal  decision 
rule  involves  the  sane  amount  of  work. 

Far  the  majority  rule  to  work  properly,  the  nusfeer  of  subgroups  k  of  the 
elements  of  X  should  be  an  odd  number.  For  the  likelihood  ratio  decision  rale, 
there  is  no  such  restriction  on  the  number  of  subgroups  of  th*  elements  of  X, 

this  mil*  is  always  the  best  based  on  the  available 
Of  S' . 


and  we  can  be  sure  that 
information  an  elements 


k 
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8.5  COMPARISON  BETWEEN  D£  AND  BEST  DECISION  ROLE  USING  THE  COMPLETE  OBSERVATION  X 


Earlier  in  the  introduction  it 


if  p,  the  diacne ion  of  X  ■  (X£,X2#X3) , 


is  vary  large  we  aay  prefer  to  construct  first  the  D^*D2'°3  ^wo 
their  results  to  form  D£.  The  discriminants  D^,D2  and  D}  along  with  D£  are 


to  be  optimum  and  D.  's  are 


independent.  Suppose,  on  the  otharharid 


we  construct  the  optimum  discriminant  D  using  the  whole  set  X.  Then  a  natural 
question  is:  Are  D  and  D£  different  from  each  other? 

There  is  no  doubt  that  D,"" being  optimum,  must  perform  at  least  as  well  as 
D  in  terms  of  probability  of  correct  decision.  Thus  all  we  need  to  verify  is 

Jb 

that:  is  D£  inferior  in  its  performance?  In  the  following  discussion  we  have 
tried  to  answer  this  question  under  some  assumptions ,  which  are  reasonable  for 
two  class  problems. 

Suppose  apr lorl-probabil 1 ties  of  the  two  classes  are  the  same.  Let 


<k) 


Sj 


(Xj)  be  the  p.d.f.  of  X^  when  is  the  true  class,  k  -  1,2:  j  “  1,2,3.  Then, 


1 


by  rule  of  optimum  decision  (we 


all  of  the  parameters  are  known) 


f  ^  (X  ) 

1  i.e.  X.  is  classified  as  from  «.  if  i  j 

J  1  (2) 

f  1  '  (X  ) 

3  3 

i.e.  Xj  is  classified  as  from  «2  efihermiss. . 


Jtf  ^  (x  ) 

i.e.  X  is  classified  as  from  u.  if  i  i 


>  1 


ttf^2’  CX3> 


i.e.  X  is  classified  as  from  <*»2  if 

(1) 


>  1 


<  1 


m  '*»  ftp  \ 

(for  sake  of  simplicity  we  assume  that  1  i  ■  1  with  probability  sero]. 

-  (2) 


‘3 


(X^) 
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First  we  consider  tha  casa  whara 

(a)  a^  -  PrtD^Xj)  -  PrtDjtX^  -  oj^) 

0»  i  s  *!  *  “2  1  *3  •ttch  ****  ai°2{1"V  *  C1p^)(1«^j)«j 


so  that  Dt  is  tha  majority  rule  U.  2  oat  of  3  rule, 
tinder  tha  above  assumptions, 

Pr(D(X)  -  l|Wl)  -  Fr|a  t^a)  (X^ I tjU)(Xj)  > 


-  'a!  f3<W<V  toj 


/  3  *1  3  3  3J 

ehsra  A  -<  X  «  X  f  ^ (1>  (Xj)  J  f^ <2)  (X^  >  1  >  . 

Art  g^tXj)  iy  PjCa>  <*.,>,  J  -  1,2,3.  Than 

*"{*•  e1e2B3>i}. 

Claarly,  tha  sat  A  contains  £  such  that  at  laast  one  of  tha  threa  ^'s  is  larger 
than  1  and,  of  coarse,  >  1.  Ws  can  easily  sea  that  A  can  be  written  in 

tans  of  tha  union  of  7  disjoint  sets  Ax  through  Ay  where 

A1  *  {*  *  V1*  V1'  V1} 

*2  "^2  *  ®2>^'  *®l®2^  *  <  ®3<3^  1  Aj  and  A^  are  defined  similarly 

with  S3  replaced  by  $2  and  a  respectively. 

Aj  -  jx  ‘  Pj>1,  $2tl,  03<1  such  that  81><82a3)"1J 


Afi  and  Ay  defined  in  a  similar  manner.  Thus 


Fr(DCJ)  *  l|wx)  -  f7  X  fj^CXj)  dxj 

I  tm  i 


(8.1) 
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In  a  Banner  similar  to  the  above  it  can  be  shown  that 


Fr (0(g)  -  0(»a)  -  Pr(Dt(£)  -  0jw2) 

(2) 


'4 

UB 


-  /. 


j 


7 

UA 


Jl  'B'  -  -1-2' 

iv‘'<v^-v*2 

k*l 


il 


2  J  5 

By  assumption  the  above  two  differences  0^-02  and 


are  equal.  But 


fll  "  ri  “  fktl>  <xk,<S*k  >  !1  *  \  }  (Xk)dX* 

Ua  **  Oh  18-1 

5  1  5  j 


“  ♦i 


and 


•j-^4 


j,*!."’  <*k,4*»  *  '<  Ja"’  <**>*>(» 

UB.  UB,  « 

9  3  *  3 


“  ♦, 


.*.  >  o  and  e2-4L  >  o. 


How,  we  observe  that  6^  B  6^  because  D  is  better  than  0£.  If  0^  -  0^  then 
4^  »  4^  and  on  the  otherhand 

♦i  »  *a  *  *i  *  ♦» 

would  iaply  >  ♦j  a  contradiction .  Thus  we  oust  have  6^  >  02-  This  iaplies 
that  D  is  always  better  than  0^  except  in  the  case  when  probabilities  of  the 
sets  Aj  through  Kj  and  B2  to  >4  are  aero  under  w^  as  well  as  e2. 

In  the  ruining  part  of  this  section  we  consider  the  special  case  of  the 
papulation  which  plays  a  significant  role  in  discriainant  analysis. 
Consider,  once  again,  the  simplest  form  of  the  discriaination  problen.  tat 


a 
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and  covariance 


follow  multi vaxiata  normal  distribution  with  naan  vector 
matrix  >  i«l,2,3  and  j-1,2  associated  with  two  classes.  Me  assume  that  X^'s 
are  all  stochastically  independent.  Under  the  above  assumptions ,  the  probabilities 
of  correct  classification  are  given  by  ■  ♦(6^(2) j  i*l,2,3  when  optimum 


discriminants  (X^)  are  used,  i“l,2,3.  Here. 


■  j[»l<JV1,]/  I?  [»i<2,-*in’]|  • 


w  - 


<  0  . 


On  the  otherhand,  if  the  whole  vector  X  -  (X^,X^,X^)  is  used  then  the 


optimum  discriminant  rule  D(X)  is  given  by 


1  if 


w  <!>-„«> 


D(X)  - 


where 


Le  DvX)  is  given  oy 


,«>  - 


<  0  , 


/ri  0  0 

I  -  0  X,  0 

Vo  0  E. 


The  probability  of  correct  classification  by  this  rule  is  d(j  4). 

I2  -  dj2  ♦  <22  +  432.  Thus,  if  4^  »  1  for  all  1,  then  the  probability  of  correct 


classification  by  this  rule  is  ♦ 


w- 


806 j  whereas  the  probability  of  correct 


decision  by  the  majority  rule  is  given  by 


♦3(.5)  ♦  3  »2  (.3)  ll-0(. 5)1  -  .7726  . 


6.  NORMAL  DISTRIBUTION  WITH  OHKNOWN  MEAN  VECTORS.  E  KgOWH 

In  this  section  we  evaluate  the  difference  between  the  probabilities  of 
correct  classification  by  the  two  methods  discussed  in  this  paper,  tie  consider 
the  case  when  the  class  conditional  distributions  are  normal  with  cowmen,  known, 
covariance  matrix  z  and  unknown  mean  vectors.  That  is,  in  the  notations  of 
the  previous  section,  for  j*l,2»  follows  multivariate  normal  distribution 
with  mean  vector  which  is  unknown,  and  covariance  matrix  £^,  assumed  to 

be  known,  i-1,2,3.  Clearly  X  •  (X^X^Xj)'  also  follows  multivariate  normal 
distribution  with  naan  vector  *  (v^*,  u2(^,  V3(^)  and  covariance  matrix 


Obviously  independence  of  X^.X^,  and  X^  is  implied  by  this  covariance  structure. 
Under  these  assuaptions,  the  discriminant  D(>)  is  used  where 

1ST  classified  to  class  1  if  ^j(x(1)+x(2))j  p1^t(1)-x(2>)  >  0 

D(Y)  - 

0  s  T  classified  to  class  2-  so 

where  »  Semple  means  of  n^  observations  from  the  jth  class,  j-1,2  and  X 
is  an  observation  to  be  classified  in  one  of  the  two  classes,  h  similar  expression 
for  the  discriminant  (T^)  will  hold  if  the  ith  subset  is  used  for  this  purpose, 
i«l,2,3i 

Clearly,  the  probability  of  correct  decision,  when  D(Y)  is  employed  is  given 

by 

Pr [correct  decision  by  D(X)|y^] 
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:[(*  -  I'1'*?”}'  r1 ,  (l<l’-f<2>)  »  0|,«l>] 

,  J-*  ,  o|*(l>,s<«,5(l>' 

:d  j{2)  l  \  2  /  '  7 


,11)  4«> 


vhara 


fern  ,<«)  „<l>  -  f  (s(ll«<2>)  r1  (s'1’-?12’) 

,[x  •*  1  ux<»-i«2»r  ri  (i«i.>uv2 


and  ♦(•)  denotes  tha  distribution  function  of  standard  nozaal  random  variable. 
In  a  similar  manner,  given  X^  and  X^2*  the  conditional  probability  of  correct 
classification  by  (Y^)  will  be  given  by  ♦  where  a^  is  also 


by  «  ^»i(i1,1,,X1<21)'j  wh*r.  «± 

«  .  (‘L(*La> .\n)  by  *!• 


defined  with  appropriate  changes.  Denote  ♦  l*i\*i  e*i  )]  ^Y  *i- 

Conditional  on  the  event  that  and  X*2*  are  given,  the  probability 

of  correct  classification  after  combining  the  results  of  and  D3  is  given 


\  if  s  c  Ax  -  £x  t  •1  (l-*2)  (1-^)  >  CI-VV3} 

♦2  if  x  *  Aj  -  ^x  »  (i-v*2  d-*3>  >  ♦1  u-*2)»3"J 

•3  if  x  c  h3  -£x  «  (1-^)  d-*2)*3  >  ♦i»2(l-*3,1 


>1^2»3  +  ♦1*2<1“V  ♦  *i<1"42,*3  +  CX-^)  *2*3 


otherwise 


Thus  the  unconditional  probability  of  correct  classification  is  given  by  the 
integral  of  the  above  probabilities  over  A^.Aj.A.^  and  the  remaining  region 
with  respect  to  the  joint  density  of  X^,X^.  This  being  a  difficult  problem 
of  integration,  we  obtain  a  lower  bound  by  integrating  the  last  expression  over 
the  entire  range.  Thus,  the  probability  of  correct  classification,  when  results 
of  D^,D2  and  D3  ass  employed  is  groater  than 


using  the  independence  of  X^'s  the  above  reduces  considerably  because,  for 
instance  we  can  replace  *l*i*2*l}  *1*^1  E l*2J  E{*3). 

The  expression  for  Ejl-6]  is  given  by  Equation  77  of  John  (1961).  The 
EU-^]  can  also  be  obtained  similarly.  Thus  an  upper  bound  for  the  difference 
in  the  probabilities  of  correct  classification  can  be  evaluated.  The  following 
table  gives  these  upper  bounds  for  few  choices  of  number  of  observations  in 
the  training  sample.  The  number  of  training  samples,  N,  are  equal  in  both 
classes.  In  the  table  4^2,  i>l ,2,3  denotes  the  Mahalonobis  distance  between 
the  two  populations,  measured  for  the  ith  subset  only. 


Fro*  the  table  it  is  clear  that  the  difference  increases  as  II  increases 

provided  other  parameters  are  fired.  For  all  other  parameters  fixed,  the 

2 

difference  increases  with  4  .  The  most  significant  observation,  from  the  table, 
is  that  if  the  actual  probability  of  correct  classification  is  large  then  the 
difference  is  also  large.  One  also  concludes  that  the  actual  difference  of 
probability  of  correct  decision  between  using  the  linear  discriminant  with  the 
whole  set  of  observation  and  in  parts  decreases  if  the  parameters  of  the  popula¬ 
tion  are  unknown.  Thus,  in  otherwords,  one  would  be  less  concerned  about  the 
loss  in  using  the  alternative  method  of  discrimination  when  the  parameters 
are  unknown. 
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8.J .EXTENSION  TO  THE  THREE  CLASS  PROBLBt 

In  the  previous  sections  we  have  considered  the  cess  when  there  ere  only 
two  desses.  In  this  section  the  ideas  of  the  previous  sections  are  extended 
to  three  class  problem.  Extension  to  sore  than  three  classes  will  be  straight 
forward.  It  will  be  seen  that  due  to  a  large  nnsber  of  parameters,  there  are 
saae  difficulties  in  getting  results  in  tbs  aost  general  fora,  but  the  basic 
concepts  reaain  unchanged.  To  formalise  the  concepts  we  denote  the  three  classes 
by  <*1**2  *nd  u3*  **  before,  we  confine  our  attention  to  the  case  whan  there  are 

three  independent  subsets  of  aaasursaants  on  each  abject.  Furthermore,  as  in 
the  previous  section,  we  have  the  results  of  the  three  discriminants  operating 
at  each  subset.  Our  aia  is  to  contains  these  results  to  decide  to  which  class  the 
given  object  belongs. 

The  sample  space  of  the  results  of  the  three  disciainants  is  given  by 

*'  fvv.) '  *  ‘  VV‘3  -  ’} 

where  the  triplet  means  that  the  first  discriminant  using  the  first 

segment  of  the  measurement  an  the  given  object  classifies  it  to  class  1^,  the 
second  discriminant,  using  the  second  segment,  to  class  and  the  third 
discriminant,  using  the  last  segment,  to  the  class  1^.  Our  object,  as  pointed 
out  earlier  also,  is  to  combine  the  result  and  classify  the  object  to 

one  of  hho  three  classes.  Given  the  sample  space  s'  and  associated  probability 
measures,  the  optimum  criteria  of  classification  would  use  the  Bayesian  approach. 

Under  the  assumption  of  equally  probable  classes  and  equal  costs  the  Bayes  approach 

would  amply  the  likelihood  functions  only,  but  the  unoqual  costs  and  unequal 

apriori  probabilities  can  be  accommodated  using  the  standard  procedures,  Anderson  (1968). 

The  probability  structure  associated  with  the  sample  space  s'is  given  by 
the  following  27  parameters 
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P.  . '  '  »  Probability  that  the  discriminant  k  will  classify  X  into  class  1 

**  j 

whan  it  actually  cones  from  class  jt 

for  1  S  i,j,k  <  3.  Obviously  these  parameters  are  not  all  independent  because 


J  P.  . ^  »  1  for  each  j  and  k 
i-1  1,3 


Furthermore ,  p 


denotes  the  probability  of  correct  classification  for  each 


j  and  k,  whereas  P.  '  '  denotes  probability  of  misclassification  for  i*j  and 
and  k.  Given  these  parameters  one  can  easily  calculate  the  probability  of 
observing  any  sample  point  of  S 'conditional  on  X  belongs  to  a  specified  class. 
For  instance,  the  probability  that  the  result  of  the  three  discriminants  will  be 
**1**2' *3*  wbcn  x  coaM*  tram  class  1  is  given  by 


L  <»Pt 

[  *1,1  *i.i  *i.t  j 


The  optimum  discriminant  method  which  is  based  upon  the  elements  of  s'ie  then  given 


by  0(X)  where  o(X)  S  t3)  *  i  means  that  X  is  classified  to  class  1. 

This  discriminant  [See  Anderson  [1]},  is  given  by 


D(X)  -  1  iff  P, 


C3)  > 


for  o  v  •  i"*i 


Based  on  the  available  information  only  D  will  be  optimum  which  can 

be  seen  from  the  arguments  given  in  Anderson  [1]. 

In  the  remaining  section  wo  study  the  above  discriminant  when  P,  , ***  satiafiee 

**  j 

additional  conditions. 


Special  Cases  Z: 


Suppose  P.  ***  are  such  that  all  of  the  probabilities  of  misclassificatlons 

*■«  J 

are  equal  for  each  class  and  each  discriminant.  Denote  this  probability  by  p. 


That  is 


00  f  ■  »  lf 

1,1  |  .  1-Sp  it  i-J  . 


r 


Zt  is  easily  Htn  that,  for  p<l/3,  D  is  equivalent  to  ths  majority  nils  except 
that  the  <  permutations  of  (1,2,3)  are  classified  arbitrarily  with  equal 
probabilities.  Further,  the  probability  of  correct  classification  will  be 
given  by 

(l-2p)3  +  3(l-2p)2p  +  j(l-2p)p2. 


Case  2 

Suppose  P,  . ^  satisifies  the  following  conditions.  Let  P,„  , ^  m  p*k* 

i#j  X  9X 

for  i'sii  1-1,2, 3»  k-1,2,3.  Clearly  ?i  1(k>  -  1-2  P(k>.  Ibis  implies  that  the 
two  probabilities  of  misclassification  of  an  object  from  class  w^  into  Wj*  are 
the  same  for  any  specified  k,  and  equal  for  each  i.  In  this  ease,  without  loss 
of  generality,  we  can  asanas  that 


>»<*> 


>  P 


(3) 


these  inequalities  imply  that#  compared  in  terms  of  probabilities  of  correct 
classification,  the  first  discriminant  is  worst  and  the  third  is  the  best.  As 
would  be  expected,  the  majority  rule  will  not  necessarily  be  ths  best  and  for 
instance  D  classifies  all  of  the  fo 'lowing  S  sample  points  to  class  w^. 

{(1,1,1),  (2,1,1),  (3,1,1),  (1,2,1),  (1,3,1)} 

As  for  the  points  (1,1,2)  or  (1,1,3)  they  would  be  classified  to  class  1  if 

xr3p..-  •  *“3?  -  >  iz2e_ — 

_<1)  -<3> 

p  p  p 


and  to  class  and  «3>  respectively,  if  the  direction  of  the  inequality  is 


reverse.  Recall  that  this  is  exactly  the  same  condition  that  we  have  seen  in 
an  earlier  section.  Similarly  we  can  find  points  which  will  be  classified  to 
w^  and  Wj.  The  remaining  6  permutations  of  (1,2,3)  are  classified  as  follows 
(1,2,3)  and  (2,1,3)  are  classified  such  that  the  associated  X  belongs  to 

(1,3,2)  and  (3,1,2)  to  i»2  and  (2,3,1)  and  (3,2,1)  to  class 
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Probabilities  of  correct  classifications  can  be  computed  easily  by  soniag 
over  all  points  which  lead  to  the  acceptance  of  x  c  when  it  is  truly  the 
case.  For  instance,  given 


1=22 


(1) 


.(1) 


i=2R 


(2) 


.(2) 


l-2p 


(3) 


.(3) 


the  probability  of  correct  classification  of  x  c  is  given  by 
(l-2p(1)) (l-2p<2)) (l-2p(3))  +  2p(1) (l-2p(2))(l-2p(3)) 

a 

+  2(l-2p(1))p(2)(l-2p<3))  +  4p(1,p(2)  (l-2p(3)), 

which,  as  expected,  simplifies  to  l-2p*^,  implying  that  the  decision  is  equivalent 
to  the  third  discriminant.  The  probabilities  can  be  evaluated  in  a  similar 
manner.  The  other  special  cases  can  also  be  studied. 

The  extension  of  the  above  idea  to  more  than  3  classes  is  straight  forward. 

The  above  concepts  can  also  be  extended  in  exactly  the  same  manner  to  the  case 
when  there  are  more  than  3  independent  segments  of  measuraswnt  on  each  abject. 

The  case  when  these  segments  of  observations  are  stochastically  dependant  is 
relatively  hard  to  study  although  the  similar  concepts  apply  in  that  situation 
also. 


Finally,  we  present  the  formulas  which  can  be  employed  to  evaluate  the 
differences  between  the  probabilities  of  the  correct  decisions  for  the  two 
methods  under  investigation,  in  the  case  of  multivariate  normal  distributions. 

He  consider  the  simple  situation  whan  all  of  the  par aam tars  are  assumed  known 
and  the  covariance  matrix  is  same.  Suppose  for  g  *  «*£»  i-1,2,3  the  mean  vector 
is  and  (without  loss  of  generality)  the  oovariance  matrix  is  Z.  As  before  2 
(and  therefore  p^)  is  p  dimensional  which  consists  of  three  segments  of 

PA»  p2  and  p^  dimensions.  Xn  the  following  we  give  the  formulas  for  the 
probabilities  of  correct  and  incorrect  clasaificatlon  when  £  «  e^  is  used.  The 


formulas  can  be  used  to  calculate  the  associated  probabilities  when  any 
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of  the  — gam nt  is  used,  of  course,  with  obvious  Modifications.  Given  this 
stzueturs  it  would  be  straight  forward  to  ealeulats  the  dssired  diffsrence. 

Zb  this  case  of  three  class  problan,  the  dseision  to  classify  2  in  class 


or  equivalently  when  (U^  >  0,  V  >  0)  where 


Define  and  <03,V3)  by 


Zt  can  be  easily  shown  that  X  will  ba  elassifsd  to  class  w  if  (0.  >  0,  V,  >  0) 

2  2  2 

and  to  class  if  (Oj  >  0,  >  0).  the  distribution  of  all  of  thasa  pairs  is 

bivariate  nornal  with  naans,  variances  and  covariances  given  below.  Zt  is  asaunad 
as  nsntionsd  earlier  also,  that  X  is  assunad  to  ba  from  «.  in  the  following 


calculations 


*/#&*/&* t,, 


and  correlation  coafficiant  between  03  and  V3  is  given  by 

VVj  Vl  ("3,  -  \Y 1 K  -  u=iY 

Given  the  above  distributions,  it  is  easy  to  calculate  that  the  probability 
of  classifying  X  <  ^  to  ^  is 

P(U. >0,  V. >o|w. )  -  I*  1  ■  —  ■  s*P  "  5>(w2+s2-2p  ws)duds 

1  1  1  -h,  -h.  2S  Vl-P„  '  ulvl  / 

“1  2  aivi 

"  L(~hl'’h2 * 

following  the  notations  of  Johnson  and  Kots  [See  equation  19,  page  94, [3]].  Bare 

*i  -  t['(\  -  "0  ’]  V2  '  ->  ■  ?  (\  -  \)  1/2  • 

Since  the  arguments  -h^-hj  la  L^-hj.-h^Pg^'j  are  negative,  we  have  to  use 
equation  22.3  of  page  95  of  Johnson  and  Sets.  At  this  stage  to  Calculate 
P(Uj>0,  v^>o|«j)  it  renains  to  look  up  P0  frosi  a  table,  for  esanple 

in  National  Bureau  of  standard  publication  (1959).  In  exactly  a  stellar  earner 
one  could  calculate 

P(XJ2X>,  VjPOl^)  and  P(U3*>,  V3>o|e1)  , 

which  provide  the  probabilities  of  eisclassif ications .  It  is  significant  to 
observe  that  the  above  probabilities  not  only  depend  upon  tbs  Mahalooobis  distance 


(jiK  '  "u)j 


but  also  upon  the  angles  between  (J^  -  |»a> ,  (|>x  -  j»3> ,  (V-,  -  Wj) 


etc. 
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